Analysis of data for Dr. Stefano Allesina’s workshop: “A Skeptic’s Guide to Scientific Writing.”
I think for the most part I am going to focus only on Articles.
Dr. Allesina visualized some of the data in a time-series format here. I think I am just going to look at the data regardless of time.
Filtered out “0” values.
Filtered out “0” values.
Melissa brought up an interesting question as to how related the number of veiws and citations are. Stefano mentioned that for some types of documents they may not be closely related at all.
Call:
glm(formula = log(num_citations + 1) ~ num_words_title, data = dat)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.9485 -0.8668 0.1172 0.9335 5.2022
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.001186 0.050745 59.142 < 2e-16 ***
num_words_title -0.026355 0.003891 -6.774 1.35e-11 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for gaussian family taken to be 1.834748)
Null deviance: 13241 on 7172 degrees of freedom
Residual deviance: 13157 on 7171 degrees of freedom
AIC: 24713
Number of Fisher Scoring iterations: 2
[1] "p-value"
(Intercept) num_words_title
0.000000e+00 1.353578e-11
Call:
glm(formula = log(num_citations + 1) ~ as.factor(year) + num_words_title,
data = dat)
Deviance Residuals:
Min 1Q Median 3Q Max
-4.1902 -0.5138 -0.0089 0.5175 4.8916
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.695139 0.109791 33.656 < 2e-16 ***
as.factor(year)2006 0.507269 0.130068 3.900 9.71e-05 ***
as.factor(year)2007 0.345912 0.122713 2.819 0.004833 **
as.factor(year)2008 0.243100 0.119650 2.032 0.042215 *
as.factor(year)2009 0.212868 0.116566 1.826 0.067868 .
as.factor(year)2010 0.044285 0.115208 0.384 0.700700
as.factor(year)2011 -0.064313 0.115291 -0.558 0.576980
as.factor(year)2012 -0.268353 0.113519 -2.364 0.018108 *
as.factor(year)2013 -0.395434 0.113180 -3.494 0.000479 ***
as.factor(year)2014 -0.565502 0.112896 -5.009 5.60e-07 ***
as.factor(year)2015 -0.701157 0.112509 -6.232 4.87e-10 ***
as.factor(year)2016 -0.913665 0.112968 -8.088 7.08e-16 ***
as.factor(year)2017 -1.160244 0.112953 -10.272 < 2e-16 ***
as.factor(year)2018 -1.576806 0.113160 -13.934 < 2e-16 ***
as.factor(year)2019 -2.121371 0.112215 -18.904 < 2e-16 ***
as.factor(year)2020 -3.107982 0.111647 -27.838 < 2e-16 ***
as.factor(year)2021 -3.609822 0.144289 -25.018 < 2e-16 ***
num_words_title -0.006114 0.002517 -2.429 0.015158 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for gaussian family taken to be 0.7517154)
Null deviance: 13241.2 on 7172 degrees of freedom
Residual deviance: 5378.5 on 7155 degrees of freedom
AIC: 18329
Number of Fisher Scoring iterations: 2
[1] "p-value"
(Intercept) as.factor(year)2006 as.factor(year)2007 as.factor(year)2008
1.174960e-230 9.705879e-05 4.832592e-03 4.221465e-02
as.factor(year)2009 as.factor(year)2010 as.factor(year)2011 as.factor(year)2012
6.786827e-02 7.006998e-01 5.769799e-01 1.810777e-02
as.factor(year)2013 as.factor(year)2014 as.factor(year)2015 as.factor(year)2016
4.789688e-04 5.600388e-07 4.865478e-10 7.077134e-16
as.factor(year)2017 as.factor(year)2018 as.factor(year)2019 as.factor(year)2020
1.396407e-24 1.448490e-43 8.071434e-78 5.115671e-162
as.factor(year)2021 num_words_title
1.692719e-132 1.515838e-02
[1] "Number of titles with a colon"
[1] 1442
[1] "Proportion of titles with a colon"
[1] 0.2010316
# A tibble: 107 x 2
value n
<chr> <int>
1 " " 84428
2 " " 1
3 "_" 2
4 "-" 3915
5 "–" 13
6 "—" 14
7 "," 565
8 ";" 9
9 ":" 1445
10 "!" 2
# … with 97 more rows